A Brief Summary of Compound Sort Key and Interleaved Sort Key with Performance Tests | Redshift
This post is English translation of the Japanese version. I'm not a native English speaker, so please let me know if you find something like a grammatical error. Cheers!
Have you ever received something like following alerts from your Redshift Advisor?
Initialize Interleaved Sort Keys
Recomendation
Run VACUUM REINDEX, as a superuser, on tables with inactive interleaved sort keys.Replace Single Column Interleaved Sort Keys
Recomendation
Recreate 5 tables to use a single-column compound sort key.
Both recommendations popped up because of using Interleaved Sort Key
. Interleaved Sorting gives equal weight to each column, or a subset of columns, which is very convenient for database designers to use this for a bunch of tables "anyway", but this sometimes has a bad influence on Redshift performance. Although you can refer to Recommendations from your Redshift Advisor which lets you know how to deal with this problem, you need to review overall sort key designs of the tables alerted to find a fundamental solution of the problem.
Today, I'm going to provide a brief summary of characteristics of Compound Sort Key and Interleaved Sort Key and differences between them, and then examine each SQL performance.
Contents
- Characteristics of Compound Sort Key and Interleaved Sort Key
- Conclusion
- Word Definitions and Sample Data
- Examination
- References
Characteristics of Compound Sort Key and Interleaved Sort Key
Compound Sort Key and Interleaved Sort Key have detail (too detail!) characteristics.
Compound Sort Key
Compound sort key is the default sort type in Redshift and made up of one or more of its columns. Basically, Compound sorting is effective with these SQL operations; ORDER BY
, GROUP BY
and PARTITION BY
used in window functions.
When using Compound sorting, you need to consider the order of columns in Compound sort key, the first one is called primary and the second one is called secondary . Querying only by the primary column has a good effect on the speed, while the secondary column and following columns can't be powerful without simultaneously using the primary. For example, when you want to query the following table which has Compound sort key, the primary dateid
can be used only in ORDER BY
operations, but the secondary eventid
is definitely used with the primary dateid
otherwise the sorting performance will be degraded.
CREATE TABLE sales_csk ( salesid integer not null, listid integer not null, sellerid integer not null, buyerid integer not null, eventid integer not null, dateid smallint not null, qtysold smallint not null, pricepaid decimal(8,2), commission decimal(8,2), saletime timestamp ) DISTSTYLE KEY DISTKEY (dateid) COMPOUND SORTKEY (dateid, eventid);
One of the major advantages of Compound sort key is that it enables Merge Join when the following criteria are met. Merge Join is the fastest way to join in Redshift and that advantage can be seen in my examination, too. However, the more complex tables or data are, the harder achieving Merge Join is, because the criteria are very strict. If the criteria are not met, Redshift chooses Hash Join or Nested Loop for the join operator instead.
The criteria of Merge Join
- Used for
INNER JOIN
andOUTER JOIN
, not forFULL JOIN
. - Both join columns are the distribution key and include the primary of Compound sort key.
- Both tables are over 80% sorted.
Also, Compound sort keys help improve compression of the column.
Interleaved Sort Key
On the other hand, Interleaved sort key is also made up of one or more of its columns, but each sort key column has equal importance . That is using only the secondary column can still work. Especially, when filtering with the WHERE
clause, or when the sort key column has a long (more than 8 bytes) common prefix, interleaved sorting will give better performance than compound sorting.
To keep fully benefit of Interleaved sorting, you need to periodically execute VACUUM REINDEX
to the tables. The total time of VACUUM REINDEX
will be longer than that of
VACUUM FULL
because a VACUUM REINDEX
operation requires an extra analysis of the distribution of the values in interleaved columns before performing a VACUUM FULL
operation. If a VACUUM REINDEX
operation can't be included in a daily batch operation, you shouldn't use an interleaved sort key on columns with monotonically increasing attributes, such as id, dates, or timestamps in order to prevent its distribution from getting larger.
It is said that if a table is large enough to require multiple 1MB blocks per slice, Interleaved sorting performs better than Compound sorting. However, in other words, it doesn't have a remarkable effect on not so large tables. If the table has only 10 million rows at most, it is hard to feel the effect of Interleaved sorting because the power of Redshift is enough to perform a high-speed process to such tables. I have an image of the capability of Interleaved sorting working usefully when a table has over 100 million records.
For your information, you can refer to the indicator interleaved_skew
in the view SVV_INTERLEAVED_COLUMNS
to find which table should be run VACUUM REINDEX
on. If the interleaved_skew is greater than 1.4, a VACUUM REINDEX
will usually improve performance.
-- to check the interleaved_skew SELECT DISTINCT sic.tbl as tbl_id, sti.schema as schema_name, stp.name as table_name, sti.tbl_rows, sic.col, sic.interleaved_skew, sic.last_reindex FROM svv_interleaved_columns sic LEFT JOIN svv_table_info sti ON sic.tbl = sti.table_id LEFT JOIN stv_tbl_perm stp ON sic.tbl = stp.id WHERE interleaved_skew > 1.4 ORDER BY schema_name, table_name, col;
Both keys have pros and cons and very complex features as mentioned above. It is hard to apply appropriately either key for each different use case without any examinations. So, I'm going to execute simple queries and make a comparison between both keys with SQL performance.
Conclusion
I'm going to tell the conclusion first because examinations are quite many. If you are operating a database for batch processing and handling less than 100 million records in a table, you want to basically use Compound Sort Key. Consider doing the following things as part of sort key design:
- Make adjustments to the relationships between the columns used as a primary or secondary key and the ones used in JOIN or WHERE queries.
- If a table is using JOIN operation in batch processing, confirm whether the joining column can be also defined as the distribution key or not. If possible, the fastest join operator, Merge Join will be selected.
- Operating compound-key-based database, if there is a bottleneck at a certain table in batch processing, you want to apply Interleaved Sort Key to it instead.
In case that a table has over than 100 million records, the capability of Interleaved sorting becomes more effective. There is no absolute way of thinking on such a complex case, so you are required to be more attentive to which sort key should be used. Consider the following:
- Improve performance by choosing Interleaved sort key for columns which are not the primary key and frequently used independently.
- Compound sorting is the fastest in a large number of cases, but sometimes it's going to be worse than Interleaved sorting. Interleaved sorting works not too good and not too bad on average.
- Pay attention to Data Redistribution in Query Plan.
- Avoid using Interleaved sort key for all of the tables because you need to perform
VACUUM REINDEX
to all huge tables. It will cost a lot. - If possible, Use
COPY
instead ofINSERT
to load data for Interleaved sort key tables. A deep copy automatically creates interleaved indexes in addition to the data load. But other concurrent updates cannot be made during a deep copy. - etc...
Please note this is not the absolute principle for any tables. Also, this principle is effective for tables which are used in batch processing and queried with complex SQL. If tables are referenced by BI clients, you should design sort keys for frequently used SQL.
Word Definitions and Sample Data
In the examination, I created sample tables with each key; No sort key, Compound sort key and Interleaved sort key, and compared them in terms of execution time and Query Plan. The word definitions in Query Plan are the following things:
Cost : A relative value that is useful for comparing operations within a plan. The first value of cost separated by two periods is the relative cost of returning the first row for this operation and the second value is the relative cost of completing the operation . (ex. cost=131.97..133.41
)
Join Operators : Redshift automatically selects these operators in join operations. This is based on the physical design of the tables, such as key design and data sorting. There are three types of operators, Nested Loop, Hash Join and Merge Join, and Merge Join is the fastest in them but hard to be achieved.
Aggregate Operators : Redshift automatically selects these operators in an aggregate function and a GROUP BY
operation. A GROUP BY
operation has two types of operators, HashAggregate and GroupAggregate, and GroupAggregate will be selected when an aggregate function is performed with sorting.
Data Redistribution : This is a method for how data is moved around a cluster to facilitate the join. There are seven types of attributes in query plans and less data movement is desirable.
Sample tables were created from Step 6: Load Sample Data from Amazon S3 - Amazon Redshift and Step 1: Create a Test Data Set - Amazon Redshift.
-- 365rows CREATE TABLE date( dateid smallint not null, caldate date not null, day character(3) not null, week smallint not null, month character(5) not null, qtr character(5) not null, year smallint not null, holiday boolean default('N') ) DISTSTYLE KEY DISTKEY(dateid) No sort key | COMPOUND SORTKEY (dateid) | INTERLEAVED SORTKEY (dateid) ; -- 8,798rows CREATE TABLE event( eventid integer not null, venueid smallint not null, catid smallint not null, dateid smallint not null, eventname varchar(200), starttime timestamp ) DISTSTYLE KEY DISTKEY(eventid) No sort key | COMPOUND SORTKEY (eventid) | INTERLEAVED SORTKEY (eventid) ; -- 172,456rows CREATE TABLE sales( salesid integer not null, listid integer not null, sellerid integer not null, buyerid integer not null, eventid integer not null, dateid smallint not null, qtysold smallint not null, pricepaid decimal(8,2), commission decimal(8,2), saletime timestamp ) DISTSTYLE KEY DISTKEY(dateid) No sort key | COMPOUND SORTKEY (dateid, eventid) | COMPOUND SORTKEY (eventid, dateid) | INTERLEAVED SORTKEY (dateid, eventid) ; -- 1,400,000rows CREATE TABLE part( p_partkey INTEGER NOT NULL, p_name VARCHAR(22) NOT NULL, p_mfgr VARCHAR(6) NOT NULL, p_category VARCHAR(7) NOT NULL, p_brand1 VARCHAR(9) NOT NULL, p_color VARCHAR(11) NOT NULL, p_type VARCHAR(25) NOT NULL, p_size INTEGER NOT NULL, p_container VARCHAR(10) NOT NULL ) DISTSTYLE KEY DISTKEY(p_partkey) No sort key | COMPOUND SORTKEY (p_partkey, p_size) | INTERLEAVED SORTKEY (p_partkey, p_size) ; -- 3,000,000rows CREATE TABLE customer( c_custkey INTEGER NOT NULL, c_name VARCHAR(25) NOT NULL, c_address VARCHAR(25) NOT NULL, c_city VARCHAR(10) NOT NULL, c_nation VARCHAR(15) NOT NULL, c_region VARCHAR(12) NOT NULL, c_phone VARCHAR(15) NOT NULL, c_mktsegment VARCHAR(10) NOT NULL ) DISTSTYLE KEY DISTKEY(c_custkey) No sort key | COMPOUND SORTKEY (c_custkey, c_region) | INTERLEAVED SORTKEY (c_custkey, c_region) ; -- 600,037,902rows CREATE TABLE lineorder( lo_orderkey INTEGER NOT NULL, lo_linenumber INTEGER NOT NULL, lo_custkey INTEGER NOT NULL, lo_partkey INTEGER NOT NULL, lo_suppkey INTEGER NOT NULL, lo_orderdate INTEGER NOT NULL, lo_orderpriority VARCHAR(15) NOT NULL, lo_shippriority VARCHAR(1) NOT NULL, lo_quantity INTEGER NOT NULL, lo_extendedprice INTEGER NOT NULL, lo_ordertotalprice INTEGER NOT NULL, lo_discount INTEGER NOT NULL, lo_revenue INTEGER NOT NULL, lo_supplycost INTEGER NOT NULL, lo_tax INTEGER NOT NULL, lo_commitdate INTEGER NOT NULL, lo_shipmode VARCHAR(10) NOT NULL ) DISTSTYLE KEY DISTKEY(c_custkey) No sort key | COMPOUND SORTKEY (lo_custkey, lo_partkey) | COMPOUND SORTKEY2 (lo_partkey, lo_custkey) | INTERLEAVED SORTKEY (lo_custkey, lo_partkey) ; /* nsk: No sort key csk: COMPOUND SORTKEY csk2: COMPOUND SORTKEY (The primary and the secondary are reversed) ilsk: INTERLEAVED SORTKEY */ copy date_nsk from 's3://awssampledbuswest2/tickit/date2008_pipe.txt' credentials 'aws_iam_role=<iam-role-arn>' delimiter '|' region 'us-west-2'; INSERT INTO date_csk SELECT * FROM date_nsk; INSERT INTO date_ilsk SELECT * FROM date_nsk; copy event_nsk from 's3://awssampledbuswest2/tickit/allevents_pipe.txt' credentials 'aws_iam_role=<iam-role-arn>' delimiter '|' timeformat 'YYYY-MM-DD HH:MI:SS' region 'us-west-2'; INSERT INTO event_csk SELECT * FROM event_nsk; INSERT INTO event_ilsk SELECT * FROM event_nsk; copy sales_nsk from 's3://awssampledbuswest2/tickit/sales_tab.txt' credentials 'aws_iam_role=<iam-role-arn>' delimiter '\t' timeformat 'MM/DD/YYYY HH:MI:SS' region 'us-west-2'; INSERT INTO sales_csk SELECT * FROM sales_nsk; INSERT INTO sales_csk2 SELECT * FROM sales_nsk; INSERT INTO sales_ilsk SELECT * FROM sales_nsk; copy customer_nsk from 's3://awssampledbuswest2/ssbgz/customer' credentials 'aws_iam_role=<iam-role-arn>' gzip compupdate off region 'us-west-2'; INSERT INTO customer_csk SELECT * FROM customer_nsk; INSERT INTO customer_ilsk SELECT * FROM customer_nsk; copy lineorder_nsk from 's3://awssampledbuswest2/ssbgz/lineorder' credentials 'aws_iam_role=<iam-role-arn>' gzip compupdate off region 'us-west-2'; INSERT INTO lineorder_csk SELECT * FROM lineorder_nsk; INSERT INTO lineorder_csk2 SELECT * FROM lineorder_nsk; INSERT INTO lineorder_ilsk SELECT * FROM lineorder_nsk;
After INSERT
, I operated VACUUM FULL
or VACUUM REINDEX
, and ANALYZE
to the tables. Ensure that the result caching has been disabled before the examination. ( Using SET enable_result_cache_for_session = off;
)
Examination
ORDER BY
-- Less than 10 million records table -- ORDER BY with the primary key SELECT dateid FROM sales ORDER BY dateid; -- EXPLAIN XN Merge (cost=1000000016724.67..1000000017155.81 rows=172456 width=2) Merge Key: dateid -> XN Network (cost=1000000016724.67..1000000017155.81 rows=172456 width=2) Send to leader -> XN Sort (cost=1000000016724.67..1000000017155.81 rows=172456 width=2) Sort Key: dateid -> XN Seq Scan on sales_nsk (cost=0.00..1724.56 rows=172456 width=2) XN Merge (cost=0.00..1724.56 rows=172456 width=2) Merge Key: dateid -> XN Network (cost=0.00..1724.56 rows=172456 width=2) Send to leader -> XN Seq Scan on sales_csk (cost=0.00..1724.56 rows=172456 width=2) XN Merge (cost=1000000016724.67..1000000017155.81 rows=172456 width=2) Merge Key: dateid -> XN Network (cost=1000000016724.67..1000000017155.81 rows=172456 width=2) Send to leader -> XN Sort (cost=1000000016724.67..1000000017155.81 rows=172456 width=2) Sort Key: dateid -> XN Seq Scan on sales_ilsk (cost=0.00..1724.56 rows=172456 width=2)
# | No sort key | Compound sort key | Interleaved sort key |
---|---|---|---|
1 | 398ms | 351ms | 356ms |
2 | 376ms | 350ms | 362ms |
3 | 355ms | 344ms | 354ms |
Cost | 1000000016724.67.. 1000000017155.81 |
0.00..1724.56 | 1000000016724.67.. 1000000017155.81 |
Execution times are almost all the same. The cost of Compound sorting is the lowest.
-- Less than 10 million records table -- ORDER BY with the secondary key SELECT eventid FROM sales ORDER BY eventid; -- EXPLAIN XN Merge (cost=1000000016724.67..1000000017155.81 rows=172456 width=4) Merge Key: eventid -> XN Network (cost=1000000016724.67..1000000017155.81 rows=172456 width=4) Send to leader -> XN Sort (cost=1000000016724.67..1000000017155.81 rows=172456 width=4) Sort Key: eventid -> XN Seq Scan on sales_nsk (cost=0.00..1724.56 rows=172456 width=4) XN Merge (cost=1000000016724.67..1000000017155.81 rows=172456 width=4) Merge Key: eventid -> XN Network (cost=1000000016724.67..1000000017155.81 rows=172456 width=4) Send to leader -> XN Sort (cost=1000000016724.67..1000000017155.81 rows=172456 width=4) Sort Key: eventid -> XN Seq Scan on sales_csk (cost=0.00..1724.56 rows=172456 width=4) XN Merge (cost=1000000016724.67..1000000017155.81 rows=172456 width=4) Merge Key: eventid -> XN Network (cost=1000000016724.67..1000000017155.81 rows=172456 width=4) Send to leader -> XN Sort (cost=1000000016724.67..1000000017155.81 rows=172456 width=4) Sort Key: eventid -> XN Seq Scan on sales_ilsk (cost=0.00..1724.56 rows=172456 width=4)
# | No sort key | Compound sort key | Interleaved sort key |
---|---|---|---|
1 | 362ms | 346ms | 355ms |
2 | 363ms | 352ms | 349ms |
3 | 355ms | 365ms | 364ms |
Cost | 1000000016724.67.. 1000000017155.81 |
1000000016724.67.. 1000000017155.81 |
1000000016724.67.. 1000000017155.81 |
Both execution times and costs are almost all the same.
-- Less than 10 million records table -- ORDER BY with the primary key and the secondary key SELECT dateid, eventid FROM sales ORDER BY dateid, eventid; -- EXPLAIN XN Merge (cost=1000000016724.67..1000000017155.81 rows=172456 width=6) Merge Key: dateid, eventid -> XN Network (cost=1000000016724.67..1000000017155.81 rows=172456 width=6) Send to leader -> XN Sort (cost=1000000016724.67..1000000017155.81 rows=172456 width=6) Sort Key: dateid, eventid -> XN Seq Scan on sales_nsk (cost=0.00..1724.56 rows=172456 width=6) XN Merge (cost=0.00..1724.56 rows=172456 width=6) Merge Key: dateid, eventid -> XN Network (cost=0.00..1724.56 rows=172456 width=6) Send to leader -> XN Seq Scan on sales_csk (cost=0.00..1724.56 rows=172456 width=6) XN Merge (cost=1000000016724.67..1000000017155.81 rows=172456 width=6) Merge Key: dateid, eventid -> XN Network (cost=1000000016724.67..1000000017155.81 rows=172456 width=6) Send to leader -> XN Sort (cost=1000000016724.67..1000000017155.81 rows=172456 width=6) Sort Key: dateid, eventid -> XN Seq Scan on sales_ilsk (cost=0.00..1724.56 rows=172456 width=6)
# | No sort key | Compound sort key | Interleaved sort key |
---|---|---|---|
1 | 438ms | 417ms | 424ms |
2 | 437ms | 425ms | 439ms |
3 | 445ms | 431ms | 449ms |
Cost | 1000000016724.67.. 1000000017155.81 |
0.00..1724.56 | 1000000016724.67.. 1000000017155.81 |
Execution times are all the same. The cost of Compound sorting is the lowest.
-- More than 10 million records table -- ORDER BY with the primary key SELECT lo_custkey FROM lineorder ORDER BY lo_custkey LIMIT 10000; -- EXPLAIN XN Merge (cost=1000093487338.12..1000094987432.84 rows=600037888 width=4) Merge Key: lo_custkey -> XN Network (cost=1000093487338.12..1000094987432.84 rows=600037888 width=4) Send to leader -> XN Sort (cost=1000093487338.12..1000094987432.84 rows=600037888 width=4) Sort Key: lo_custkey -> XN Seq Scan on lineorder_nsk (cost=0.00..6000378.88 rows=600037888 width=4) XN Merge (cost=0.00..6000378.88 rows=600037888 width=4) Merge Key: lo_custkey -> XN Network (cost=0.00..6000378.88 rows=600037888 width=4) Send to leader -> XN Seq Scan on lineorder_csk (cost=0.00..6000378.88 rows=600037888 width=4) XN Merge (cost=1000093487338.12..1000094987432.84 rows=600037888 width=4) Merge Key: lo_custkey -> XN Network (cost=1000093487338.12..1000094987432.84 rows=600037888 width=4) Send to leader -> XN Sort (cost=1000093487338.12..1000094987432.84 rows=600037888 width=4) Sort Key: lo_custkey -> XN Seq Scan on lineorder_ilsk (cost=0.00..6000378.88 rows=600037888 width=4)
# | No sort key | Compound sort key | Interleaved sort key |
---|---|---|---|
1 | 1.8s | 97ms | 1.4s |
2 | 1.7s | 93ms | 1.5s |
3 | 1.7s | 56ms | 1.5s |
Cost | 1000093487338.12.. 1000094987432.84 |
0.00..6000378.88 | 1000093487338.12.. 1000094987432.84 |
The execution time of Compound sorting is the fastest and the one of Interleaved sorting is the next-fastest. The cost of Compound sorting is the lowest.
-- More than 10 million records table -- ORDER BY with the secondary key SELECT lo_partkey FROM lineorder ORDER BY lo_partkey LIMIT 10000; -- EXPLAIN XN Merge (cost=1000093487338.12..1000094987432.84 rows=600037888 width=4) Merge Key: lo_partkey -> XN Network (cost=1000093487338.12..1000094987432.84 rows=600037888 width=4) Send to leader -> XN Sort (cost=1000093487338.12..1000094987432.84 rows=600037888 width=4) Sort Key: lo_partkey -> XN Seq Scan on lineorder_nsk (cost=0.00..6000378.88 rows=600037888 width=4) XN Merge (cost=1000093487338.12..1000094987432.84 rows=600037888 width=4) Merge Key: lo_partkey -> XN Network (cost=1000093487338.12..1000094987432.84 rows=600037888 width=4) Send to leader -> XN Sort (cost=1000093487338.12..1000094987432.84 rows=600037888 width=4) Sort Key: lo_partkey -> XN Seq Scan on lineorder_csk (cost=0.00..6000378.88 rows=600037888 width=4) XN Merge (cost=1000093487338.12..1000094987432.84 rows=600037888 width=4) Merge Key: lo_partkey -> XN Network (cost=1000093487338.12..1000094987432.84 rows=600037888 width=4) Send to leader -> XN Sort (cost=1000093487338.12..1000094987432.84 rows=600037888 width=4) Sort Key: lo_partkey -> XN Seq Scan on lineorder_ilsk (cost=0.00..6000378.88 rows=600037888 width=4)
# | No sort key | Compound sort key | Interleaved sort key |
---|---|---|---|
1 | 1.9s | 1.4s | 1.5s |
2 | 1.7s | 1.4s | 1.4s |
3 | 1.8s | 1.5s | 1.5s |
Cost | 1000093487338.12.. 1000094987432.84 |
1000093487338.12.. 1000094987432.84 |
1000093487338.12.. 1000094987432.84 |
In terms of execution times, there is no differences between Compound sorting and Interleaved sorting and both are faster than no sort key. The costs are almost all the same.
-- More than 10 million records table -- ORDER BY with the primary key and the secondary key SELECT lo_custkey, lo_partkey FROM lineorder ORDER BY lo_custkey, lo_partkey LIMIT 10000; -- EXPLAIN XN Merge (cost=1000093487338.12..1000094987432.84 rows=600037888 width=8) Merge Key: lo_custkey, lo_partkey -> XN Network (cost=1000093487338.12..1000094987432.84 rows=600037888 width=8) Send to leader -> XN Sort (cost=1000093487338.12..1000094987432.84 rows=600037888 width=8) Sort Key: lo_custkey, lo_partkey -> XN Seq Scan on lineorder_nsk (cost=0.00..6000378.88 rows=600037888 width=8) XN Merge (cost=0.00..6000378.88 rows=600037888 width=8) Merge Key: lo_custkey, lo_partkey -> XN Network (cost=0.00..6000378.88 rows=600037888 width=8) Send to leader -> XN Seq Scan on lineorder_csk (cost=0.00..6000378.88 rows=600037888 width=8) XN Merge (cost=1000093487338.12..1000094987432.84 rows=600037888 width=8) Merge Key: lo_custkey, lo_partkey -> XN Network (cost=1000093487338.12..1000094987432.84 rows=600037888 width=8) Send to leader -> XN Sort (cost=1000093487338.12..1000094987432.84 rows=600037888 width=8) Sort Key: lo_custkey, lo_partkey -> XN Seq Scan on lineorder_ilsk (cost=0.00..6000378.88 rows=600037888 width=8)
# | No sort key | Compound sort key | Interleaved sort key |
---|---|---|---|
1 | 2.3s | 69ms | 1.7s |
2 | 2.4s | 66ms | 1.8s |
3 | 2.3s | 61ms | 1.7s |
Cost | 1000093487338.12.. 1000094987432.84 |
0.00..6000378.88 | 1000093487338.12.. 1000094987432.84 |
The result is the same as "ORDER BY with the primary key", but the difference between them is more obvious. The cost of Compound sorting is the lowest.
GROUP BY
-- Less than 10 million records table -- GROUP BY with the primary key SELECT dateid, count(*) FROM sales GROUP BY dateid; -- EXPLAIN XN HashAggregate (cost=2586.84..2587.75 rows=364 width=2) -> XN Seq Scan on sales_nsk (cost=0.00..1724.56 rows=172456 width=2) XN GroupAggregate (cost=0.00..2587.71 rows=350 width=2) -> XN Seq Scan on sales_csk (cost=0.00..1724.56 rows=172456 width=2) XN HashAggregate (cost=2586.84..2587.75 rows=363 width=2) -> XN Seq Scan on sales_ilsk (cost=0.00..1724.56 rows=172456 width=2)
# | No sort key | Compound sort key | Interleaved sort key |
---|---|---|---|
1 | 22ms | 21ms | 23ms |
2 | 24ms | 22ms | 24ms |
3 | 22ms | 21ms | 22ms |
Cost | 2586.84..2587.75 | 0.00..2587.71 | 2586.84..2587.75 |
Aggregate Operators | HashAggregate | GroupAggregate | HashAggregate |
Execution times are almost the same. The initiate cost of Compound sorting is lower than the others, but the total ones are all the same.
-- Less than 10 million records table -- GROUP BY with the secondary key SELECT eventid, count(*) FROM sales GROUP BY eventid; -- EXPLAIN XN HashAggregate (cost=2586.84..2606.31 rows=7787 width=4) -> XN Seq Scan on sales_nsk (cost=0.00..1724.56 rows=172456 width=4) XN HashAggregate (cost=2586.84..2606.42 rows=7834 width=4) -> XN Seq Scan on sales_csk (cost=0.00..1724.56 rows=172456 width=4) XN HashAggregate (cost=2586.84..2606.27 rows=7772 width=4) -> XN Seq Scan on sales_ilsk (cost=0.00..1724.56 rows=172456 width=4)
# | No sort key | Compound sort key | Interleaved sort key |
---|---|---|---|
1 | 61ms | 62ms | 53ms |
2 | 53ms | 57ms | 60ms |
3 | 48ms | 48ms | 49ms |
Cost | 2586.84..2606.31 | 2586.84..2606.42 | 2586.84..2606.27 |
Aggregate Operators | HashAggregate | HashAggregate | HashAggregate |
Both execution times and costs are almost all the same.
-- Less than 10 million records table -- GROUP BY with the primary key and the secondary key SELECT dateid, eventid, count(*) FROM sales GROUP BY dateid, eventid; -- EXPLAIN XN HashAggregate (cost=3017.98..3061.09 rows=17246 width=6) -> XN Seq Scan on sales_nsk (cost=0.00..1724.56 rows=172456 width=6) XN GroupAggregate (cost=0.00..3061.09 rows=17246 width=6) -> XN Seq Scan on sales_csk (cost=0.00..1724.56 rows=172456 width=6) XN HashAggregate (cost=3017.98..3061.09 rows=17246 width=6) -> XN Seq Scan on sales_ilsk (cost=0.00..1724.56 rows=172456 width=6)
# | No sort key | Compound sort key | Interleaved sort key |
---|---|---|---|
1 | 551ms | 437ms | 430ms |
2 | 683ms | 411ms | 586ms |
3 | 439ms | 659ms | 438ms |
Cost | 3017.98..3061.09 | 0.00..3061.09 | 3017.98..3061.09 |
Aggregate Operators | HashAggregate | GroupAggregate | HashAggregate |
Execution times are almost all the same. The initiate cost of Compound sorting is lower than the others, but the total ones are all the same.
-- More than 10 million records table -- GROUP BY with the primary key SELECT lo_custkey, count(*) FROM lineorder GROUP BY lo_custkey; -- EXPLAIN XN HashAggregate (cost=9000568.32..9004948.86 rows=1752214 width=4) -> XN Seq Scan on lineorder_nsk (cost=0.00..6000378.88 rows=600037888 width=4) XN GroupAggregate (cost=0.00..9005054.16 rows=1794335 width=4) -> XN Seq Scan on lineorder_csk (cost=0.00..6000378.88 rows=600037888 width=4) XN HashAggregate (cost=9000568.32..9005161.02 rows=1837079 width=4) -> XN Seq Scan on lineorder_ilsk (cost=0.00..6000378.88 rows=600037888 width=4)
# | No sort key | Compound sort key | Interleaved sort key |
---|---|---|---|
1 | 13.2s | 4.6s | 8.5s |
2 | 13.3s | 4.6s | 8.3s |
3 | 13.3s | 4.6s | 8.4s |
Cost | 9000568.32.. 9004948.86 |
0.00.. 9005054.16 |
9000568.32.. 9005161.02 |
Aggregate Operators | HashAggregate | GroupAggregate | HashAggregate |
The execution time of Compound sorting is the fastest and the one of Interleaved sorting is the next-fastest. In terms of total costs, Interleaved sorting is lower than Compound sorting.
-- More than 10 million records table -- GROUP BY with the secondary key SELECT lo_partkey, count(*) FROM lineorder GROUP BY lo_partkey; -- EXPLAIN XN HashAggregate (cost=9000568.32..9002981.02 rows=965078 width=4) -> XN Seq Scan on lineorder_nsk (cost=0.00..6000378.88 rows=600037888 width=4) XN HashAggregate (cost=9000568.32..9003137.57 rows=1027699 width=4) -> XN Seq Scan on lineorder_csk (cost=0.00..6000378.88 rows=600037888 width=4) XN HashAggregate (cost=9000568.32..9003314.96 rows=1098655 width=4) -> XN Seq Scan on lineorder_ilsk (cost=0.00..6000378.88 rows=600037888 width=4)
# | No sort key | Compound sort key | Interleaved sort key |
---|---|---|---|
1 | 17.4s | 15.8s | 7.4s |
2 | 17.2s | 15.3s | 7.3s |
3 | 16.3s | 16.1s | 7.3s |
Cost | 9000568.32.. 9002981.02 |
9000568.32.. 9003137.57 |
9000568.32.. 9003314.96 |
Aggregate Operators | HashAggregate | HashAggregate | HashAggregate |
Finally, we can obviously see the advantage of Interleaved sorting in this pattern, "More than 10 million records table" and "GROUP BY with the secondary key". Although the cost of Interleaved sorting is the highest, the execution time is the fastest.
-- More than 10 million records table -- GROUP BY with the primary key and the secondary key SELECT lo_custkey, lo_partkey, count(*) FROM lineorder GROUP BY lo_custkey, lo_partkey LIMIT 10000; -- EXPLAIN XN HashAggregate (cost=10500663.04..10650672.51 rows=60003789 width=8) -> XN Seq Scan on lineorder_nsk (cost=0.00..6000378.88 rows=600037888 width=8) XN GroupAggregate (cost=0.00..10650672.51 rows=60003789 width=8) -> XN Seq Scan on lineorder_csk (cost=0.00..6000378.88 rows=600037888 width=8) XN HashAggregate (cost=10500663.04..10650672.51 rows=60003789 width=8) -> XN Seq Scan on lineorder_ilsk (cost=0.00..6000378.88 rows=600037888 width=8)
# | No sort key | Compound sort key | Interleaved sort key |
---|---|---|---|
1 | 37.7s | 60ms | 33.7s |
2 | 36.8s | 67ms | 35.1s |
3 | 38.0s | 62ms | 34.6s |
Cost | 10500663.04.. 10650672.51 |
0.00.. 10650672.51 |
10500663.04.. 10650672.51 |
Aggregate Operators | HashAggregate | GroupAggregate | HashAggregate |
The performance of Interleaved sorting is getting extremely bad when GROUP BY is executed with the primary and the secondary. On the other hand, the performance of Compound sorting is very good. The initiate cost of Compound sorting is lower than the others, but the total ones are all the same.
WHERE
-- Less than 10 million records table -- WHERE with the primary key SELECT dateid FROM sales WHERE dateid IN ('2000', '2050', '1850'); -- EXPLAIN XN Seq Scan on sales_nsk (cost=0.00..3017.98 rows=1395 width=2) Filter: ((dateid = 1850::smallint) OR (dateid = 2000::smallint) OR (dateid = 2050::smallint)) XN Seq Scan on sales_csk (cost=0.00..24.44 rows=1397 width=2) Filter: ((dateid = 1850::smallint) OR (dateid = 2000::smallint) OR (dateid = 2050::smallint)) XN Seq Scan on sales_ilsk (cost=0.00..3017.98 rows=1398 width=2) Filter: ((dateid = 1850::smallint) OR (dateid = 2000::smallint) OR (dateid = 2050::smallint))
# | No sort key | Compound sort key | Interleaved sort key |
---|---|---|---|
1 | 32ms | 35ms | 30ms |
2 | 25ms | 33ms | 27ms |
3 | 27ms | 27ms | 30ms |
Cost | 0.00..3017.98 | 0.00..24.44 | 0.00..3017.98 |
Execution times are all the same. The cost of Compound sorting is the lowest.
-- Less than 10 million records table -- WHERE with the secondary key SELECT eventid FROM sales WHERE eventid BETWEEN 10 AND 1000; -- EXPLAIN XN Seq Scan on sales_nsk (cost=0.00..2586.84 rows=19865 width=4) Filter: ((eventid <= 1000) AND (eventid >= 10)) XN Seq Scan on sales_csk (cost=0.00..2586.84 rows=19710 width=4) Filter: ((eventid <= 1000) AND (eventid >= 10)) XN Seq Scan on sales_ilsk (cost=0.00..2586.84 rows=19153 width=4) Filter: ((eventid <= 1000) AND (eventid >= 10))
# | No sort key | Compound sort key | Interleaved sort key |
---|---|---|---|
1 | 73ms | 73ms | 75ms |
2 | 72ms | 70ms | 69ms |
3 | 65ms | 63ms | 67ms |
Cost | 0.00..2586.84 | 0.00..2586.84 | 0.00..2586.84 |
Both execution times and costs are almost all the same.
-- Less than 10 million records table -- WHERE with the primary key and the secondary key SELECT dateid, eventid FROM sales WHERE dateid BETWEEN 1000 AND 2000 AND eventid BETWEEN 10 AND 1000; -- EXPLAIN XN Seq Scan on sales_nsk (cost=0.00..3449.12 rows=9480 width=6) Filter: ((eventid <= 1000) AND (eventid >= 10) AND (dateid <= 2000) AND (dateid >= 1000)) XN Seq Scan on sales_csk (cost=0.00..1639.04 rows=9366 width=6) Filter: ((eventid <= 1000) AND (eventid >= 10) AND (dateid <= 2000) AND (dateid >= 1000)) XN Seq Scan on sales_ilsk (cost=0.00..3449.12 rows=9093 width=6) Filter: ((eventid <= 1000) AND (eventid >= 10) AND (dateid <= 2000) AND (dateid >= 1000))
# | No sort key | Compound sort key | Interleaved sort key |
---|---|---|---|
1 | 64ms | 48ms | 65ms |
2 | 53ms | 49ms | 51ms |
3 | 46ms | 55ms | 54ms |
Cost | 0.00..3449.12 | 0.00..1639.04 | 0.00..3449.12 |
Execution times are all the same. The cost of Compound sorting is the lowest.
-- More than 10 million records table -- WHERE with the primary key SELECT lo_custkey FROM lineorder WHERE lo_custkey IN ('123434', '123', '3415678'); -- EXPLAIN XN Seq Scan on lineorder_nsk (cost=0.00..10500663.04 rows=1028 width=4) Filter: ((lo_custkey = 123) OR (lo_custkey = 123434) OR (lo_custkey = 3415678)) XN Seq Scan on lineorder_csk (cost=0.00..17.56 rows=1004 width=4) Filter: ((lo_custkey = 123) OR (lo_custkey = 123434) OR (lo_custkey = 3415678)) XN Seq Scan on lineorder_ilsk (cost=0.00..10500663.04 rows=980 width=4) Filter: ((lo_custkey = 123) OR (lo_custkey = 123434) OR (lo_custkey = 3415678))
# | No sort key | Compound sort key | Interleaved sort key |
---|---|---|---|
1 | 1.4s | 22ms | 143ms |
2 | 1.3s | 25ms | 135m |
3 | 1.3s | 24ms | 134ms |
Cost | 0.00..10500663.04 | 0.00..17.56 | 0.00..10500663.04 |
The execution time of Compound sorting is the fastest and the one of Interleaved sorting is the next-fastest. The cost of Compound sorting is the lowest.
-- More than 10 million records table -- WHERE with the secondary key SELECT lo_partkey FROM lineorder WHERE lo_partkey BETWEEN 500 AND 1000; -- EXPLAIN XN Seq Scan on lineorder_nsk (cost=0.00..9000568.32 rows=309378 width=4) Filter: ((lo_partkey <= 1000) AND (lo_partkey >= 500)) XN Seq Scan on lineorder_csk (cost=0.00..9000568.32 rows=308579 width=4) Filter: ((lo_partkey <= 1000) AND (lo_partkey >= 500)) XN Seq Scan on lineorder_ilsk (cost=0.00..9000568.32 rows=305233 width=4) Filter: ((lo_partkey <= 1000) AND (lo_partkey >= 500))
# | No sort key | Compound sort key | Interleaved sort key |
---|---|---|---|
1 | 1.3s | 2.3s | 575ms |
2 | 1.3s | 1.8s | 579ms |
3 | 1.3s | 2.3s | 594ms |
Cost | 0.00..9000568.32 | 0.00..9000568.32 | 0.00..9000568.32 |
As well as GROUP BY, we can see the advantage of Interleaved sorting in this pattern. The execution time of Compound sort key is slower than the one of No sort key. The costs are all the same.
-- More than 10 million records table -- WHERE with the primary key and the secondary key SELECT lo_custkey, lo_partkey FROM lineorder WHERE lo_custkey BETWEEN 500 AND 2000 AND lo_partkey BETWEEN 500 AND 1000; -- EXPLAIN XN Seq Scan on lineorder_nsk (cost=0.00..12000757.76 rows=157 width=8) Filter: ((lo_custkey <= 2000) AND (lo_custkey >= 500) AND (lo_partkey <= 1000) AND (lo_partkey >= 500)) XN Seq Scan on lineorder_csk (cost=0.00..6017.71 rows=155 width=8) Filter: ((lo_custkey <= 2000) AND (lo_custkey >= 500) AND (lo_partkey <= 1000) AND (lo_partkey >= 500)) XN Seq Scan on lineorder_ilsk (cost=0.00..12000757.76 rows=151 width=8) Filter: ((lo_custkey <= 2000) AND (lo_custkey >= 500) AND (lo_partkey <= 1000) AND (lo_partkey >= 500))
# | No sort key | Compound sort key | Interleaved sort key |
---|---|---|---|
1 | 1.6s | 25ms | 24ms |
2 | 1.6s | 23ms | 23ms |
3 | 1.7s | 23ms | 22ms |
Cost | 0.00..12000757.76 | 0.00..6017.71 | 0.00..12000757.76 |
Both Compound sorting and Interleaved sorting are much faster than no sorting. Unlike GROUP BY, the Interleaved sort key made up of multiple columns still has good performance in the WHERE clause. The cost of Compound sorting is the lowest.
JOIN
-- Less than 10 million records table -- JOIN with the primary key SELECT s.dateid, s.eventid, s.salesid, d.caldate FROM sales s LEFT JOIN date d ON s.dateid = d.dateid; -- EXPLAIN XN Hash Left Join DS_DIST_NONE (cost=4.56..5609.38 rows=172456 width=14) Hash Cond: ("outer".dateid = "inner".dateid) -> XN Seq Scan on sales_nsk s (cost=0.00..1724.56 rows=172456 width=10) -> XN Hash (cost=3.65..3.65 rows=365 width=6) -> XN Seq Scan on date_nsk d (cost=0.00..3.65 rows=365 width=6) XN Merge Left Join DS_DIST_NONE (cost=0.00..3884.82 rows=172456 width=14) Merge Cond: ("outer".dateid = "inner".dateid) -> XN Seq Scan on sales_csk s (cost=0.00..1724.56 rows=172456 width=10) -> XN Seq Scan on date_csk d (cost=0.00..3.65 rows=365 width=6) XN Hash Left Join DS_BCAST_INNER (cost=4.56..58405609.38 rows=172456 width=14) Hash Cond: ("outer".dateid = "inner".dateid) -> XN Seq Scan on sales_csk2 s (cost=0.00..1724.56 rows=172456 width=10) -> XN Hash (cost=3.65..3.65 rows=365 width=6) -> XN Seq Scan on date_csk d (cost=0.00..3.65 rows=365 width=6) XN Hash Left Join DS_DIST_NONE (cost=4.56..5609.38 rows=172456 width=14) Hash Cond: ("outer".dateid = "inner".dateid) -> XN Seq Scan on sales_ilsk s (cost=0.00..1724.56 rows=172456 width=10) -> XN Hash (cost=3.65..3.65 rows=365 width=6) -> XN Seq Scan on date_ilsk d (cost=0.00..3.65 rows=365 width=6)
# | No sort key | Compound sort key | Compound sort key 2 | Interleaved sort key |
---|---|---|---|---|
1 | 681ms | 655ms | 687ms | 670ms |
2 | 682ms | 679ms | 685ms | 683ms |
3 | 665ms | 684ms | 677ms | 678ms |
Cost | 4.56..5609.38 | 0.00..3884.82 | 4.56.. 58405609.38 |
4.56..5609.38 |
Join Operaters | Hash Left Join | Merge Left Join | Hash Left Join | Hash Left Join |
Data Redistribution | DS_DIST_NONE | DS_DIST_NONE | DS_BCAST_INNER | DS_DIST_NONE |
Execution times are all the same, but the costs of each sort key are different. The cost of Compound sorting with Merge join as the Join Operator was the lowest, and that of Compound sorting 2 with DS_BCAST_INNER as the Data Redistribution explosively increased.
-- Less than 10 million records table -- JOIN with the secondary key SELECT s.dateid, s.eventid, s.salesid, e.eventname FROM sales s LEFT JOIN event e ON s.eventid = e.eventid; -- EXPLAIN XN Hash Left Join DS_BCAST_INNER (cost=109.98..2815365714.80 rows=172456 width=27) Hash Cond: ("outer".eventid = "inner".eventid) -> XN Seq Scan on sales_nsk s (cost=0.00..1724.56 rows=172456 width=10) -> XN Hash (cost=87.98..87.98 rows=8798 width=21) -> XN Seq Scan on event_nsk e (cost=0.00..87.98 rows=8798 width=21) XN Hash Left Join DS_BCAST_INNER (cost=109.98..2815365714.80 rows=172456 width=27) Hash Cond: ("outer".eventid = "inner".eventid) -> XN Seq Scan on sales_csk s (cost=0.00..1724.56 rows=172456 width=10) -> XN Hash (cost=87.98..87.98 rows=8798 width=21) -> XN Seq Scan on event_csk e (cost=0.00..87.98 rows=8798 width=21) XN Merge Left Join DS_DIST_NONE (cost=0.00..3990.24 rows=172456 width=27) Merge Cond: ("outer".eventid = "inner".eventid) -> XN Seq Scan on sales_csk2 s (cost=0.00..1724.56 rows=172456 width=10) -> XN Seq Scan on event_csk e (cost=0.00..87.98 rows=8798 width=21) XN Hash Left Join DS_BCAST_INNER (cost=109.98..2815365714.80 rows=172456 width=27) Hash Cond: ("outer".eventid = "inner".eventid) -> XN Seq Scan on sales_ilsk s (cost=0.00..1724.56 rows=172456 width=10) -> XN Hash (cost=87.98..87.98 rows=8798 width=21) -> XN Seq Scan on event_ilsk e (cost=0.00..87.98 rows=8798 width=21)
# | No sort key | Compound sort key | Compound sort key 2 | Interleaved sort key |
---|---|---|---|---|
1 | 699ms | 669ms | 676ms | 720ms |
2 | 752ms | 694ms | 689ms | 680ms |
3 | 691ms | 706ms | 696ms | 675ms |
Cost | 109.98.. 2815365714.80 |
109.98.. 2815365714.80 |
0.00..3990.24 | 109.98.. 2815365714.80 |
Join Operaters | Hash Left Join | Hash Left Join | Merge Left Join | Hash Left Join |
Data Redistribution | DS_BCAST_INNER | DS_BCAST_INNER | DS_DIST_NONE | DS_BCAST_INNER |
Execution times are all the same. The total cost of Compound sort key with Merge Join is the lowest.
-- Less than 10 million records table -- JOIN with the primary key and the secondary key SELECT s.dateid, s.eventid, s.salesid, e.eventname FROM sales s LEFT JOIN event e ON s.dateid = e.dateid AND s.eventid = e.eventid; -- EXPLAIN XN Hash Left Join DS_BCAST_INNER (cost=131.97..2815366172.66 rows=172456 width=27) Hash Cond: (("outer".eventid = "inner".eventid) AND ("outer".dateid = "inner".dateid)) -> XN Seq Scan on sales_nsk s (cost=0.00..1724.56 rows=172456 width=10) -> XN Hash (cost=87.98..87.98 rows=8798 width=23) -> XN Seq Scan on event_nsk e (cost=0.00..87.98 rows=8798 width=23) XN Hash Left Join DS_BCAST_INNER (cost=131.97..2815366172.66 rows=172456 width=27) Hash Cond: (("outer".eventid = "inner".eventid) AND ("outer".dateid = "inner".dateid)) -> XN Seq Scan on sales_csk s (cost=0.00..1724.56 rows=172456 width=10) -> XN Hash (cost=87.98..87.98 rows=8798 width=23) -> XN Seq Scan on event_csk e (cost=0.00..87.98 rows=8798 width=23) XN Merge Left Join DS_DIST_NONE (cost=0.00..2723.54 rows=172456 width=27) Merge Cond: (("outer".eventid = "inner".eventid) AND ("outer".dateid = "inner".dateid)) -> XN Seq Scan on sales_csk2 s (cost=0.00..1724.56 rows=172456 width=10) -> XN Seq Scan on event_csk e (cost=0.00..87.98 rows=8798 width=23) XN Hash Left Join DS_BCAST_INNER (cost=131.97..2815366172.66 rows=172456 width=27) Hash Cond: (("outer".eventid = "inner".eventid) AND ("outer".dateid = "inner".dateid)) -> XN Seq Scan on sales_ilsk s (cost=0.00..1724.56 rows=172456 width=10) -> XN Hash (cost=87.98..87.98 rows=8798 width=23) -> XN Seq Scan on event_ilsk e (cost=0.00..87.98 rows=8798 width=23)
# | No sort key | Compound sort key | Compound sort key 2 | Interleaved sort key |
---|---|---|---|---|
1 | 557ms | 575ms | 574ms | 568ms |
2 | 565ms | 577ms | 561ms | 572ms |
3 | 592ms | 603ms | 603ms | 570ms |
Cost | 131.97.. 2815366172.66 |
131.97.. 2815366172.66 |
0.00..2723.54 | 131.97.. 2815366172.66 |
Join Operaters | Hash Left Join | Hash Left Join | Merge Left Join | Hash Left Join |
Data Redistribution | DS_BCAST_INNER | DS_BCAST_INNER | DS_DIST_NONE | DS_BCAST_INNER |
The result is almost same as the previous one. Using both the primary key and the secondary key rather than only using the primary key decreases the cost.
-- More than 10 million records table -- JOIN with the primary key SELECT l.lo_custkey, l.lo_orderdate, c.c_name FROM lineorder l LEFT JOIN customer c ON l.lo_custkey = c.c_custkey LIMIT 10000; -- EXPLAIN XN Hash Left Join DS_DIST_NONE (cost=37500.00..37539868.00 rows=600037888 width=30) Hash Cond: ("outer".lo_custkey = "inner".c_custkey) -> XN Seq Scan on lineorder_nsk l (cost=0.00..6000378.88 rows=600037888 width=8) -> XN Hash (cost=30000.00..30000.00 rows=3000000 width=26) -> XN Seq Scan on customer_nsk c (cost=0.00..30000.00 rows=3000000 width=26) XN Merge Left Join DS_DIST_NONE (cost=0.00..13538352.48 rows=600037888 width=30) Merge Cond: ("outer".lo_custkey = "inner".c_custkey) -> XN Seq Scan on lineorder_csk l (cost=0.00..6000378.88 rows=600037888 width=8) -> XN Seq Scan on customer_csk c (cost=0.00..30000.00 rows=3000000 width=26) XN Hash Left Join DS_BCAST_INNER (cost=37500.00..1080037539868.00 rows=600037888 width=30) Hash Cond: ("outer".lo_custkey = "inner".c_custkey) -> XN Seq Scan on lineorder_csk2 l (cost=0.00..6000378.88 rows=600037888 width=8) -> XN Hash (cost=30000.00..30000.00 rows=3000000 width=26) -> XN Seq Scan on customer_csk c (cost=0.00..30000.00 rows=3000000 width=26) XN Hash Left Join DS_DIST_NONE (cost=37500.00..37539868.00 rows=600037888 width=30) Hash Cond: ("outer".lo_custkey = "inner".c_custkey) -> XN Seq Scan on lineorder_ilsk l (cost=0.00..6000378.88 rows=600037888 width=8) -> XN Hash (cost=30000.00..30000.00 rows=3000000 width=26) -> XN Seq Scan on customer_ilsk c (cost=0.00..30000.00 rows=3000000 width=26)
# | No sort key | Compound sort key | Compound sort key 2 | Interleaved sort key |
---|---|---|---|---|
1 | 203ms | 88ms | 594ms | 211ms |
2 | 221ms | 105ms | 567ms | 218ms |
3 | 208ms | 97ms | 628ms | 205ms |
Cost | 37500.00.. 37539868.00 |
0.00.. 13538352.48 |
37500.00.. 1080037539868.00 |
37500.00.. 37539868.00 |
Join Operaters | Hash Left Join | Merge Left Join | Hash Left Join | Hash Left Join |
Data Redistribution | DS_DIST_NONE | DS_DIST_NONE | DS_BCAST_INNER | DS_DIST_NONE |
The differences between these sort keys are more obvious when the number of records is over 10 million. Compound sorting was the fastest, and Compound sorting 2 with DS_BCAST_INNER as the Data Redistribution got more than three times as slow as the others.
-- More than 10 million records table -- JOIN with the secondary key SELECT l.lo_custkey, l.lo_orderdate, p.p_name FROM lineorder l LEFT JOIN part p ON l.lo_partkey = p.p_partkey LIMIT 10000; -- EXPLAIN XN Hash Left Join DS_BCAST_INNER (cost=17500.00..392025519110.24 rows=600037888 width=24) Hash Cond: ("outer".lo_partkey = "inner".p_partkey) -> XN Seq Scan on lineorder_nsk l (cost=0.00..6000378.88 rows=600037888 width=12) -> XN Hash (cost=14000.00..14000.00 rows=1400000 width=20) -> XN Seq Scan on part_nsk p (cost=0.00..14000.00 rows=1400000 width=20) XN Hash Left Join DS_BCAST_INNER (cost=17500.00..392025519110.24 rows=600037888 width=24) Hash Cond: ("outer".lo_partkey = "inner".p_partkey) -> XN Seq Scan on lineorder_csk l (cost=0.00..6000378.88 rows=600037888 width=12) -> XN Hash (cost=14000.00..14000.00 rows=1400000 width=20) -> XN Seq Scan on part_csk p (cost=0.00..14000.00 rows=1400000 width=20) XN Merge Left Join DS_DIST_NONE (cost=0.00..13513363.88 rows=600037888 width=24) Merge Cond: ("outer".lo_partkey = "inner".p_partkey) -> XN Seq Scan on lineorder_csk2 l (cost=0.00..6000378.88 rows=600037888 width=12) -> XN Seq Scan on part_csk p (cost=0.00..14000.00 rows=1400000 width=20) XN Hash Left Join DS_BCAST_INNER (cost=17500.00..392025519110.24 rows=600037888 width=24) Hash Cond: ("outer".lo_partkey = "inner".p_partkey) -> XN Seq Scan on lineorder_ilsk l (cost=0.00..6000378.88 rows=600037888 width=12) -> XN Hash (cost=14000.00..14000.00 rows=1400000 width=20) -> XN Seq Scan on part_ilsk p (cost=0.00..14000.00 rows=1400000 width=20)
# | No sort key | Compound sort key | Compound sort key 2 | Interleaved sort key |
---|---|---|---|---|
1 | 338ms | 355ms | 72ms | 323ms |
2 | 294ms | 320ms | 79ms | 398ms |
3 | 297ms | 332ms | 74ms | 348ms |
Cost | 17500.00.. 392025519110.24 |
17500.00.. 392025519110.24 |
0.00.. 13513363.88 |
17500.00.. 392025519110.24 |
Join Operaters | Hash Left Join | Hash Left Join | Merge Left Join | Hash Left Join |
Data Redistribution | DS_BCAST_INNER | DS_BCAST_INNER | DS_DIST_NONE | DS_BCAST_INNER |
The performance of Compound sorting with Merge join is by far the best.
-- More than 10 million records table -- JOIN with the primary key and the secondary key SELECT l.lo_custkey, l.lo_orderdate FROM lineorder l LEFT JOIN lineorder l2 ON l.lo_custkey = l2.lo_custkey AND l.lo_partkey = l2.lo_partkey LIMIT 10000; -- EXPLAIN XN Hash Left Join DS_DIST_NONE (cost=9000568.32..7230458679.56 rows=600037888 width=8) Hash Cond: (("outer".lo_custkey = "inner".lo_custkey) AND ("outer".lo_partkey = "inner".lo_partkey)) -> XN Seq Scan on lineorder_nsk l (cost=0.00..6000378.88 rows=600037888 width=12) -> XN Hash (cost=6000378.88..6000378.88 rows=600037888 width=8) -> XN Seq Scan on lineorder_nsk l2 (cost=0.00..6000378.88 rows=600037888 width=8) XN Merge Left Join DS_DIST_NONE (cost=0.00..18003089.13 rows=600037888 width=8) Merge Cond: (("outer".lo_custkey = "inner".lo_custkey) AND ("outer".lo_partkey = "inner".lo_partkey)) -> XN Seq Scan on lineorder_csk l (cost=0.00..6000378.88 rows=600037888 width=12) -> XN Seq Scan on lineorder_csk l2 (cost=0.00..6000378.88 rows=600037888 width=8) XN Hash Left Join DS_DIST_OUTER (cost=9000568.32..60011019258465.38 rows=600037888 width=8) Outer Dist Key: l.lo_custkey Hash Cond: (("outer".lo_custkey = "inner".lo_custkey) AND ("outer".lo_partkey = "inner".lo_partkey)) -> XN Seq Scan on lineorder_csk2 l (cost=0.00..6000378.88 rows=600037888 width=12) -> XN Hash (cost=6000378.88..6000378.88 rows=600037888 width=8) -> XN Seq Scan on lineorder_csk l2 (cost=0.00..6000378.88 rows=600037888 width=8) XN Hash Left Join DS_DIST_NONE (cost=9000568.32..7230458334.30 rows=600037888 width=8) Hash Cond: (("outer".lo_custkey = "inner".lo_custkey) AND ("outer".lo_partkey = "inner".lo_partkey)) -> XN Seq Scan on lineorder_ilsk l (cost=0.00..6000378.88 rows=600037888 width=12) -> XN Hash (cost=6000378.88..6000378.88 rows=600037888 width=8) -> XN Seq Scan on lineorder_ilsk l2 (cost=0.00..6000378.88 rows=600037888 width=8)
# | No sort key | Compound sort key | Compound sort key 2 | Interleaved sort key |
---|---|---|---|---|
1 | 27.0s | 90ms | 44.9s | 25.8s |
2 | 27.1s | 77ms | 44.7s | 25.8s |
3 | 26.9s | 69ms | 44.7s | 25.9s |
Cost | 9000568.32.. 7230458679.56 |
0.00.. 18003089.13 |
9000568.32.. 60011019258465.38 |
9000568.32.. 7230458334.30 |
Join Operaters | Hash Left Join | Merge Left Join | Hash Left Join | Hash Left Join |
Data Redistribution | DS_DIST_NONE | DS_DIST_NONE | DS_DIST_OUTER | DS_DIST_NONE |
I didn't have a just right joining table for lineorder, so I used the same table for joining. As the result of this, Redshift came to handle over 16 billion records, but the performance of Compound sorting was still really good. (That was amazing!) Compound sorting with DS_DIST_OUTER as the Data Redistribution is more than twice as slow as the others.
Therefore, if you can correctly deal with Compound sort keys, you can achieve maximum benefits for the performance of Redshift, otherwise the performance gets worse than Interleaved sorting. Interleaved sorting becomes an easy way out for the sort key design if the tables are really huge and complex.
When to achieve Merge Join
-- More than 10 million records table -- JOIN with the primary key SELECT l.lo_custkey, l.lo_orderdate, c.c_name FROM lineorder_csk l LEFT JOIN customer_csk c ON l.lo_custkey = c.c_custkey LIMIT 10000; SELECT l.lo_custkey, l.lo_orderdate, c.c_name FROM lineorder_csk l LEFT JOIN (SELECT c_custkey, c_name FROM customer_csk ) c ON l.lo_custkey = c.c_custkey LIMIT 10000; SELECT l.lo_custkey, l.lo_orderdate, c.c_name FROM lineorder_csk l LEFT JOIN customer_csk c ON l.lo_custkey::text = c.c_custkey::text LIMIT 10000; -- EXPLAIN XN Merge Left Join DS_DIST_NONE (cost=0.00..13538352.48 rows=600037888 width=30) Merge Cond: ("outer".lo_custkey = "inner".c_custkey) -> XN Seq Scan on lineorder_csk l (cost=0.00..6000378.88 rows=600037888 width=8) -> XN Seq Scan on customer_csk c (cost=0.00..30000.00 rows=3000000 width=26) XN Merge Left Join DS_DIST_NONE (cost=0.00..13538352.48 rows=600037888 width=30) Merge Cond: ("outer".lo_custkey = "inner".c_custkey) -> XN Seq Scan on lineorder_csk l (cost=0.00..6000378.88 rows=600037888 width=8) -> XN Seq Scan on customer_csk (cost=0.00..30000.00 rows=3000000 width=26) XN Hash Left Join DS_BCAST_INNER (cost=37500.00..6570354213173.60 rows=9000568320000 width=30) Hash Cond: ((("outer".lo_custkey)::character varying)::text = (("inner".c_custkey)::character varying)::text) -> XN Seq Scan on lineorder_csk l (cost=0.00..6000378.88 rows=600037888 width=8) -> XN Hash (cost=30000.00..30000.00 rows=3000000 width=26) -> XN Seq Scan on customer_csk c (cost=0.00..30000.00 rows=3000000 width=26)
# | As-is | Subquery | Convert the datatype |
---|---|---|---|
1 | 62ms | 82ms | 1.2s |
2 | 99ms | 81ms | 1.2s |
3 | 80ms | 81ms | 1.2s |
cost | 0.00..13538352.48 | 0.00..13538352.48 | 37500.00..6570354213173.60 |
Join Operaters | Merge Left Join | Merge Left Join | Hash Left Join |
Merge Join is achieved in a subquery, but when you changed the column's datatype, it's not achieved.
References
- Query Plan - Amazon Redshift
- The LOOP, HASH and MERGE Join Types
- VACUUM - Amazon Redshift
- Deciding Whether to Reindex - Amazon Redshift
- Choosing Sort Keys - Amazon Redshift
- Comparing Sort Styles - Amazon Redshift
- Choose the Best Sort Key - Amazon Redshift
- Amazon Redshift Engineering’s Advanced Table Design Playbook: Distribution Styles and Distribution Keys | AWS Big Data Blog
- Amazon Redshift Engineering’s Advanced Table Design Playbook: Compound and Interleaved Sort Keys | AWS Big Data Blog
- Performing a Deep Copy - Amazon Redshift
- EXPLAIN - Amazon Redshift
- Managing the Size of the Unsorted Region - Amazon Redshift
- SVV_DISKUSAGE - Amazon Redshift
- SVL_QLOG - Amazon Redshift
- Step 1: Create a Test Data Set - Amazon Redshift
- Step 6: Load Sample Data from Amazon S3 - Amazon Redshift